Potential-based Shaping in Model-based Reinforcement Learning

نویسندگان

  • John Asmuth
  • Michael L. Littman
  • Robert Zinkov
چکیده

Potential-based shaping was designed as a way of introducing background knowledge into model-free reinforcement-learning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find near-optimal behavior. An orthogonal way of decreasing experience complexity is to use a model-based learning approach, building and exploiting an explicit transition model. In this paper, we show how potential-based shaping can be redefined to work in the model-based setting to produce an algorithm that shares the benefits of both ideas. Introduction Like reinforcement learning, the term shaping comes from the animal-learning literature. In training animals, shaping is the idea of directly rewarding behaviors that are needed for the main task being learned. Similarly, in reinforcementlearning parlance, shaping means introducing “hints” about optimal behavior into the reward function so as to accelerate the learning process. This paper examines potential-based shaping functions, which introduce artificial rewards in a particular form that is guaranteed to leave the optimal behavior unchanged yet can influence the agent’s exploration behavior to decrease the time spent trying suboptimal actions. It addresses how shaping functions can be used with model-based learning, specifically the Rmax algorithm (Brafman & Tennenholtz 2002). The paper makes three main contributions. First, it demonstrates a concrete way of using shaping functions with model-based learning algorithms and relates it to model-free shaping and the Rmax algorithm. Second, it argues that “admissible” shaping functions are of particular interest because they do not interfere with Rmax’s ability to identify nearoptimal policies quickly. Third, it presents computational experiments that show how model-based learning, shaping, and their combination can speed up learning. The next section provides background on reinforcementlearning algorithms. The following one defines different notions of return as a way of showing the effect of different Copyright c © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ways of incorporating shaping into both model-based and model-free algorithms. Next, admissible shaping functions are defined and analyzed. Finally, the experimental section applies both model-based/model-freealgorithms in both shaped/unshaped forms to two simple benchmark problems to illustrate the benefits of combining shaping with modelbased learning. Background We use the following notation. A Markov decision process (MDP) (Puterman 1994) is defined by a set of states S, actions A, reward function R : S×A→ <, transition function T : S × A → Π(S) (Π(·) is a discrete probability distribution over the given set), and discount factor 0 ≤ γ ≤ 1 for downweighting future rewards. If γ = 1, we assume all non-terminal rewards are non-positive and that there is a zero-reward absorbing state (goal) that can be reached from all other states. Agents attempt to maximize cumulative expected discounted reward (expected return). We assume that all reward values are bounded above by the value Rmax. We define vmax to be the largest expected return attainable from any state—if it is unknown, it can be derived from vmax = Rmax/(1− γ) if γ < 1 and vmax = Rmax if γ = 1 (because of the assumptions described above). An MDP can be solved to find an optimal policy—a way of choosing actions in states to maximize expected return. For any state s ∈ S, it is optimal to choose any a ∈ A that maximizes Q(s, a) defined by the simultaneous equations: Q(s, a) = R(s, a) + γ ∑ s T (s, a, s′) max a Q(s′, a′). In the reinforcement-learning (RL) setting, an agent interacts with the MDP, taking actions and observing state transitions and rewards. Two well-studied families of RL algorithms are model-free and model-based algorithms. Although the distinction can be fuzzy, in this paper, we examine classic examples of these families—Q-learning (Watkins & Dayan 1992) and Rmax (Brafman & Tennenholtz 2002). An RL algorithm takes experience tuples as input and produces action decisions as output. An experience tuple 〈s, a, s, r〉 represents a single transition from state s to s under the influence of action a. The value r is the immediate reward received on this transition. We next describe two concrete RL algorithms. Q-learning is a model-free algorithm that maintains an estimate Q̂ of the Q function Q. Starting from some initial values, each experience tuple 〈s, a, s, r〉 changes the Q function estimate via Q̂(s, a)←α r + γ max a Q̂(s′, a′), where α is a learning rate and “x ←α y” means that x is assigned the value (1 − α)x + αy. The Q function estimate is used to make an action decision in state s ∈ S, usually by choosing the a ∈ A such that Q̂(s, a) is maximized. Q-learning is actually a family algorithms. A complete Qlearning specification includes rules for initializing Q̂, for changing α over time, and for selecting actions. The primary theoretical guarantee provided by Q-learning is that Q̂ converges to Q in the limit of infinite experience if α is decreased at the right rate and all s, a pairs begin an infinite number of experience tuples. A prototypical model-based algorithm is Rmax, which creates “optimistic” estimates T̂ and R̂ of T and R. Specifically, Rmax hypothesizes a state smax such that T̂ (smax, a, smax) = 1, R̂(smax, a) = 0 for all a. It also initializes T̂ (s, a, smax) = 1 and R̂(s, a) = vmax for all s and a. (Unmentioned transitions have probability 0.) It keeps a transition count c(s, a, s) and a reward total t(s, a), both initialized to zero. In its practical form, it also has an integer parameter m ≥ 0. Each experience tuple 〈s, a, s, r〉 results in an update c(s, a, s)← c(s, a, s) + 1 and t(s, a)← t(s, a)+r. Then, if ∑ s c(s, a, s ) = m, the algorithm modifies T̂ (s, a, s) = c(s, a, s)/m and R̂(s, a) = t(s, a)/m. At this point, the algorithm computes Q̂(s, a) = R̂(s, a) + γ ∑ s T̂ (s, a, s′) max a Q̂(s′, a′) for all state–actions pairs. When in state s, the algorithm chooses the action that maximizes Q̂(s, a). Rmax guarantees, with high probability, that it will take near-optimal actions on all but a polynomial number of steps if m is set sufficiently high (Kakade 2003). We refer to any state–action pair s, a such that ∑ s c(s, a, s ) < m as unknown and otherwise known. Notice that, each time a state–action pair becomes known, Rmax performs a great deal of computation, solving an approximate MDP model. In contrast, Q-learning consistently has low per-experience computational complexity. The principle advantage of Rmax, then, is that it makes more efficient use of the experience it gathers at the cost of increased computational complexity. Shaping and model-based learning have both been shown to decrease experience complexity compared to a straightforward implementation of Q-learning. In this paper, we show that the benefits of these two ideas are orthogonal— their combination is more effective than either approach alone. The next section formalizes the effect of potentialbased shaping in both model-free and model-based settings. Comparing Definitions of Return We next discuss several notions of how a trajectory is summarized by a single value, the return. We address the original notion of return, shaped return, Rmax return, and finally shaped Rmax return. Original Return Let’s consider an infinite sequence of states, actions, and rewards: s̄ = s0, a0, r0, . . . , st, at, rt, . . .. In the discounted framework, the return for this sequence is taken to be U(s̄) = ∑ ∞ i=0 γ ri. Note that the Q function for any policy can be decomposed into an average of returns like this:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reward Shaping for Model-Based Bayesian Reinforcement Learning

Bayesian reinforcement learning (BRL) provides a formal framework for optimal exploration-exploitation tradeoff in reinforcement learning. Unfortunately, it is generally intractable to find the Bayes-optimal behavior except for restricted cases. As a consequence, many BRL algorithms, model-based approaches in particular, rely on approximated models or real-time search methods. In this paper, we...

متن کامل

Learning Shaping Rewards in Model-based Reinforcement Learning

Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains how to compute the potential which is used to shape the reward that is given to the learning agent. In this paper...

متن کامل

Reward Shaping in Episodic Reinforcement Learning

Recent advancements in reinforcement learning confirm that reinforcement learning techniques can solve large scale problems leading to high quality autonomous decision making. It is a matter of time until we will see large scale applications of reinforcement learning in various sectors, such as healthcare and cyber-security, among others. However, reinforcement learning can be time-consuming be...

متن کامل

Plan-based reward shaping for multi-agent reinforcement learning

Recent theoretical results have justified the use of potential-based reward shaping as a way to improve the performance of multi-agent reinforcement learning (MARL). However, the question remains of how to generate a useful potential function. Previous research demonstrated the use of STRIPS operator knowledge to automatically generate a potential function for single-agent reinforcement learnin...

متن کامل

Online learning of shaping rewards in reinforcement learning

Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains of how to compute the potential function which is used to shape the reward that is given to the learning agent. I...

متن کامل

Potential-Based Shaping and Q-Value Initialization are Equivalent

Shaping has proven to be a powerful but precarious means of improving reinforcement learning performance. Ng, Harada, and Russell (1999) proposed the potential-based shaping algorithm for adding shaping rewards in a way that guarantees the learner will learn optimal behavior. In this note, we prove certain similarities between this shaping algorithm and the initialization step required for seve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008